Perfect and Maximum Randomness in Stratified Sampling over Joins

نویسندگان

  • Niranjan Kamat
  • Arnab Nandi
چکیده

Supporting sampling in the presence of joins is an important problem in data analysis. Pushing down the sampling operator through both sides of the join is inherently challenging due to data skew and correlation issues between output tuples. Joining simple random samples of base relations typically leads to results that are non-random. Current solutions to this problem perform biased sampling of one (and not both) of the base relations to obtain a simple random sample. These techniques are not always practical since they may result in the sample size being greater than the size of the relations due to sample inflation, rendering sampling counter-productive. This paper presents a unified strategy towards sampling over joins, comprising two key contributions. First, in the case that perfect sampling is a requirement, we introduce techniques to generate a perfect random sample from both sides of a join. We show that the challenges faced in sampling over joins are ameliorated in the context of stratified random sampling as opposed to simple random sampling. We reduce the dependency of feasibility of sampling from relation level to strata level. Our technique minimizes the sample size while maintaining perfect randomness. Second, in the case that random sampling is not a requirement but is still preferred, we provide a novel sampling heuristic to maximize randomness of the join. It allows us to allocate a fixed sample size between multiple relations consisting of multiple strata to maximize the join randomness. We validate our techniques theoretically and empirically using synthetic datasets and a standard benchmark.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Combinatorial optimization with 2-joins

A 2-join is an edge cutset that naturally appears in decomposition of several classes of graphs closed under taking induced subgraphs, such as perfect graphs and claw-free graphs. In this paper we construct combinatorial polynomial time algorithms for finding a maximum weighted clique, a maximum weighted stable set and an optimal coloring for a class of perfect graphs decomposable by 2-joins: t...

متن کامل

Memory-Limited Execution of Windowed Stream Joins

We address the problem of computing approximate answers to continuous sliding-window joins over data streams when the available memory may be insufficient to keep the entire join state. One approximation scenario is to provide a maximum subset of the result, with the objective of losing as few result tuples as possible. An alternative scenario is to provide a random sample of the join result, e...

متن کامل

Random Sampling over Joins Revisited

Joins are expensive, especially on large data and/or multiple relations. One promising approach in mitigating their high costs is to just return a simple random sample of the full join results, which is sufficient for many tasks. Indeed, in as early as 1999, Chaudhuri et al. posed the problem of sampling over joins as a fundamental challenge in large database systems. They also pointed out a fu...

متن کامل

On the inverse maximum perfect matching problem under the bottleneck-type Hamming distance

Given an undirected network G(V,A,c) and a perfect matching M of G, the inverse maximum perfect matching problem consists of modifying minimally the elements of c so that M becomes a maximum perfect matching with respect to the modified vector. In this article, we consider the inverse problem when the modifications are measured by the weighted bottleneck-type Hamming distance. We propose an alg...

متن کامل

Outer Joins in a Deductive Database System

Outer joins are extended relational algebra operations intended to deal with unknown information represented with null values. This work shows an approach to embed both null values and outer join operations in the deductive database system DES (Datalog Educational System), which uses Datalog as a query language. This system also supports SQL, where views and queries are compiled to Datalog prog...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1601.05118  شماره 

صفحات  -

تاریخ انتشار 2016